Spatial Ecology and Macroecology

Practical - Week 4 Pt.2

Carmen Soria & François Leroy

(Department of Spatial Sciences)

2022-11-13

What are we going to see today?

Species Distribution Modelling using MaxEnt

  1. Maximum entropy principle
  2. Maxent in SDMs
  3. Maxent - Data preparation
  4. Maxent - Modelling

Maximum Entropy

Maximum Entropy Principle

  • Entropy is a measure of uncertainty or disorder in a system.

  • The principle of maximum entropy states that the most appropriate distribution to model a given set of data is the one with highest entropy among all those that satisfy the constraints of our prior knowledge.

  • It allows you to find the distribution that makes the fewest assumptions about your data (the one with maximal information entropy)

Maxent SDMs

  • Maxent (Maximum Entropy) is a statistical modeling approach based on the Maximum Entropy principle.

  • It finds the probability distribution that maximizes entropy, subject to the constraints imposed by the available information (e.g. environmental variables).

  • It requires presence data (presence-background model) and it is widely used for SDMs.

  • It provides estimates of relative suitability, not occupancy probability.

Maxent SDMs

  • “MaxEnt takes a list of species presence locations as input, often called presence-only (PO) data, as well as a set of environmental predictors (e.g. precipitation, temperature) across a user-defined landscape that is divided into grid cells. From this landscape, Maxent extracts a sample of background locations that it contrasts against the presence locations. Presence is unknown at background locations” Merrow et al. (2013)
  • “Maxent uses PB data to obtain a picture of environmental characteristics at presence sites and at background locations (which are a random or regular sample of the landscape, or could be targeted to match known biases in the sampling process)” Guillera-Arroitia et al. (2014)

EXERCISE SDM using Maxent for the Red admiral butterfly (Vanessa atalanta)

1. Conceptualisation

  • Taxa: Red admiral butterfly (Vanessa atalanta)
  • Objective: find climatically suitable areas
  • Biodiversity data: GBIF observations
  • Predictors: Climatic predictors
  • Location: Czechia

2. Data preparation

2.1 Getting species observations (Practical 1)

pacman::p_load(rgbif, tidyverse, rnaturalearth, rnaturalearthdata, sf)

Variables that we will use

taxa <- "Vanessa atalanta"
country_code <- "CZ"
proj_crs <- 4326 # EPSG code for WGS84

Base map of our area of study

2.1 Getting species observations (Practical 1)

Getting the number of occurrence records

[1] 1143

Download occurrence data using rgbif

2.2 Checking and cleaning the data

We inspect the downloaded occurrences plotting them in a world map using rnaturalearth, sf and ggplot2

2.2 Checking and cleaning the data

Clean the data using CoordinateCleaner

Plot the data to check if it makes sense after cleaning

2.3 Getting climatic data (Practical 3)

Getting WorldClim variables using the package geodata for our area of study (Czechia)

2.4 Spatial thinning

Keeping only one observation per grid cell. We can use spatSample from the terra package or thin from the spThin package.

Plotting the thinned data to check if it makes sense

2.5 Generating background data

To generate the background data, we sample random points from our area of study using the sampleRandom function from the raster package.

2.5 Extracting climate data

Extract the climatic data for our presences and background points using extract from the terra package.

2.7 Testing for collinearity in the predictors and select predictors

Identifying variables that have a correlation of over 70%

2.7 Testing for collinearity in the predictors and select predictors

Selecting the variables in our data

# Selecting the most relevant and uncorrelated variables
my_preds <- c( "wc2.1_30s_bio_1", # Annual mean temperature
               "wc2.1_30s_bio_2", # Mean diurnal range
               "wc2.1_30s_bio_4", # Temperature seasonality
               "wc2.1_30s_bio_9", # Mean temperature of driest quarter
               "wc2.1_30s_bio_17", # Precipitation of driest quarter
               "wc2.1_30s_bio_12") #Annual precipitation


# Filtering the relevant columns of the data
vanat_data <- vanat_obs_clim %>% 
  dplyr::select(any_of(my_preds))
bg_data <- bg_clim %>% 
  dplyr::select(any_of(my_preds))

2.8 Separating the data into training and testing

set.seed(1312)

# Divide randomly intro 70 and 30
train_i <- sample(seq_len(nrow(vanat_data)), size=round(0.7*nrow(vanat_data)))

# Subset the training and testing data
vanat_train <- vanat_data[train_i,]
vanat_test <- vanat_data[-train_i,]

2.9 Renaming variables, joining presence and background points

Renaming the variables

p <- vanat_train
p_test <- vanat_test
a <- bg_data

Joining presence and background data. Maxent identifies presences as 1 and background points as 0.

# Vector of presences and background points
pa <- c(rep(1, nrow(p)), rep(0, nrow(a)))

# Dataframe of the environmental conditions of presences and background points
pder <- as.data.frame(rbind(p, a))

3. Fitting the Maxent

We will fit the Maxent model using the dismo package

pacman::p_load(dismo)
mod <- maxent(x = pder, p = pa)

Plotting the relative variable contribution.

Which is the variable that contributes the most?

plot(mod)

Viewing the results of the model

                                                                                        [,1]
X.Training.samples                                                                  550.0000
Regularized.training.gain                                                             0.1169
Unregularized.training.gain                                                           0.1411
Iterations                                                                          380.0000
Training.AUC                                                                          0.6486
X.Background.points                                                                1334.0000
wc2.1_30s_bio_1.contribution                                                          4.6424
wc2.1_30s_bio_12.contribution                                                        78.8980
wc2.1_30s_bio_17.contribution                                                         4.6743
wc2.1_30s_bio_2.contribution                                                          5.9832
wc2.1_30s_bio_4.contribution                                                          1.5785
wc2.1_30s_bio_9.contribution                                                          4.2235
wc2.1_30s_bio_1.permutation.importance                                               19.0363
wc2.1_30s_bio_12.permutation.importance                                              65.6388
wc2.1_30s_bio_17.permutation.importance                                               0.2608
wc2.1_30s_bio_2.permutation.importance                                                5.8696
wc2.1_30s_bio_4.permutation.importance                                                8.8200
wc2.1_30s_bio_9.permutation.importance                                                0.3746
Entropy                                                                               7.0786
Prevalence..average.probability.of.presence.over.background.sites.                    0.5521
Fixed.cumulative.value.1.cumulative.threshold                                         1.0000
Fixed.cumulative.value.1.Cloglog.threshold                                            0.3309
Fixed.cumulative.value.1.area                                                         0.9730
Fixed.cumulative.value.1.training.omission                                            0.0091
Fixed.cumulative.value.5.cumulative.threshold                                         5.0000
Fixed.cumulative.value.5.Cloglog.threshold                                            0.3789
Fixed.cumulative.value.5.area                                                         0.8928
Fixed.cumulative.value.5.training.omission                                            0.0382
Fixed.cumulative.value.10.cumulative.threshold                                       10.0000
Fixed.cumulative.value.10.Cloglog.threshold                                           0.4252
Fixed.cumulative.value.10.area                                                        0.8066
Fixed.cumulative.value.10.training.omission                                           0.0800
Minimum.training.presence.cumulative.threshold                                        0.2935
Minimum.training.presence.Cloglog.threshold                                           0.2570
Minimum.training.presence.area                                                        0.9895
Minimum.training.presence.training.omission                                           0.0000
X10.percentile.training.presence.cumulative.threshold                                12.6374
X10.percentile.training.presence.Cloglog.threshold                                    0.4468
X10.percentile.training.presence.area                                                 0.7661
X10.percentile.training.presence.training.omission                                    0.1000
Equal.training.sensitivity.and.specificity.cumulative.threshold                      40.2325
Equal.training.sensitivity.and.specificity.Cloglog.threshold                          0.5374
Equal.training.sensitivity.and.specificity.area                                       0.4033
Equal.training.sensitivity.and.specificity.training.omission                          0.4036
Maximum.training.sensitivity.plus.specificity.cumulative.threshold                   48.0194
Maximum.training.sensitivity.plus.specificity.Cloglog.threshold                       0.5964
Maximum.training.sensitivity.plus.specificity.area                                    0.3208
Maximum.training.sensitivity.plus.specificity.training.omission                       0.4636
Balance.training.omission..predicted.area.and.threshold.value.cumulative.threshold    0.2935
Balance.training.omission..predicted.area.and.threshold.value.Cloglog.threshold       0.2570
Balance.training.omission..predicted.area.and.threshold.value.area                    0.9895
Balance.training.omission..predicted.area.and.threshold.value.training.omission       0.0000
Equate.entropy.of.thresholded.and.original.distributions.cumulative.threshold         5.2220
Equate.entropy.of.thresholded.and.original.distributions.Cloglog.threshold            0.3809
Equate.entropy.of.thresholded.and.original.distributions.area                         0.8891
Equate.entropy.of.thresholded.and.original.distributions.training.omission            0.0400

Predict function

Predicting the relative suitability of the territory for our taxa based on our results

pred1 <- predict(mod, bioclim_data)  

Model evaluation

We evaluate the models using the evaluate function from the dismo package for the training and testing data.

mod_eval_train <- dismo::evaluate(p = p, a = a, model = mod)
mod_eval_test <- dismo::evaluate(p = p_test, a = a, model = mod)
# Evaluation of the training model
print(mod_eval_train)
class          : ModelEvaluation 
n presences    : 550 
n absences     : 786 
AUC            : 0.7526671 
cor            : 0.4630484 
max TPR+TNR at : 0.5962506 
# Evaluation of the testing model
print(mod_eval_test) 
class          : ModelEvaluation 
n presences    : 236 
n absences     : 786 
AUC            : 0.7095247 
cor            : 0.3704079 
max TPR+TNR at : 0.634419 

Response curves

How each environmental variable affects the Maxent prediction. They show how the logistic prediction changes as each environmental variable is varied, keeping all other environmental variables at their average sample value.

Resources

  • NIMBioS Ecological Niche Modelling course (website and github)

  • Other resources mentioned in the previous SDM session

Any questions?